skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Finlayson, Mark A"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Patent landscaping is the process of identifying all patents related to a particular technological area, and is important for assessing various aspects of the intellectual property context. Traditionally, constructing patent landscapes is intensely laborious and expensive, and the rapid expansion of patenting activity in recent decades has driven an increasing need for efficient and effective automated patent landscaping approaches. In particular, it is critical that we be able to construct patent landscapes using a minimal number of labeled examples, as labeling patents for a narrow technology area requires highly specialized (and hence expensive) technical knowledge. We present an automated neural patent landscaping system that demonstrates significantly improved performance on difficult examples (0.69 on ‘hard’ examples, versus 0.6 for previously reported systems), and also significant improvements with much less training data (overall 0.75 on as few as 24 examples). Furthermore, in evaluating such automated landscaping systems, acquiring good data is challenge; we demonstrate a higher-quality training data generation procedure by merging (Abood and Feltenberger Artif Intell Law 26:103–125 2018) “seed/anti-seed” approach with active learning to collect difficult labeled examples near the decision boundary. Using this procedure we created a new dataset of labeled AI patents for training and testing. As in prior work we compare our approach with a number of baseline systems, and we release our code and data for others to build upon “(Code and data may be downloaded from https://doi.org/10.34703/gzx1-9v95/QDLKVWCode and data are released under the Creative Commons NC-BY 4.0 license at https://creativecommons.org/licenses/by-nc/4.0/)”. 
    more » « less
    Free, publicly-accessible full text available October 4, 2026
  2. The 2023 update to the Artificial Intelligence Patent Dataset (AIPD) extends the original AIPD to all United States Patent and Trademark Office (USPTO) patent documents (i.e., patents and pre-grant publications, or PGPubs) published through 2023, while incorporating an improved patent landscaping methodology to identify AI within patents and PGPubs. This new approach substitutes BERT for Patents for the Word2Vec embeddings used previously, and uses active learning to incorporate additional training data closer to the “decision boundary” between AI and not-AI to help improve predictions. We show that this new approach achieves substantially better performance than the original methodology on a set of patent documents where the two methods disagreed—on this set, the AIPD 2023 achieved precision of 68.18 percent and recall of 78.95 percent, while the original AIPD achieved 50 percent and 21.05 percent, respectively. To help researchers, practitioners, and policy-makers better understand the determinants and impacts of AI invention, we have made the AIPD 2023 publicly available on the USPTO’s economic research web page. 
    more » « less
    Free, publicly-accessible full text available February 22, 2026
  3. null (Ed.)
    We introduce the task ofstory fragment stitching,which is the process of automatically aligning andmerging event sequences of partial tellings of astory (i.e.,story fragments). We assume that eachfragment contains at least one event from the storyof interest, and that every fragment shares at leastone event with another fragment. We propose agraph-based unsupervised approach to solving thisproblem in which events mentions are representedas nodes in the graph, and the graph is compressedusing a variant of model merging to combine nodes.The goal is for each node in the final graph to con-tain only coreferent event mentions. To find coref-erent events, we use BERT contextualized embed-ding in conjunction with atf-idfvector representa-tion. Constraints on the merge compression pre-serve the overall timeline of the story, and the finalgraph represents the full story timeline. We evalu-ate our approach using a new annotated corpus ofthe partial tellings of the story of Moses found inthe Quran, which we release for public use. Ourapproach achieves a performance of 0.63F1score 
    more » « less
  4. null (Ed.)
    Animacy is the characteristic of a referent beingable to independently carry out actions in a storyworld (e.g., movement, communication). It is anecessary property of characters in stories, and sodetecting animacy is an important step in automaticstory understanding; it is also potentially useful formany other natural language processing tasks suchas word sense disambiguation, coreference resolu-tion, character identification, and semantic role la-beling. Recent work by Jahanet al.[2018]demon-strated a new approach to detecting animacy whereanimacy is considered a direct property of corefer-ence chains (and referring expressions) rather thanwords. In Jahanet al., they combined hand-builtrules and machine learning (ML) to identify the an-imacy of referring expressions and used majorityvoting to assign the animacy of coreference chains,and reported high performance of up to 0.90F1. Inthis short report we verify that the approach gener-alizes to two different corpora (OntoNotes and theCorpus of English Novels) and we confirmed thatthe hybrid model performs best, with the rule-basedmodel in second place. Our tests apply the animacyclassifier to almost twice as much data as Jahanetal.’s initial study. Our results also strongly suggest,as would be expected, the dependence of the mod-els on coreference chain quality. We release ourdata and code to enable reproducibility. 
    more » « less
  5. Recognizing the internal structure of events is a challenging language processing task of great importance for text understanding. We present a supervised model for automatically identifying when one event is a subevent of another. Building on prior work, we introduce several novel features, in particular discourse and narrative features, that significantly improve upon prior state-of-the-art performance. Error analysis further demonstrates the utility of these features. We evaluate our model on the only two annotated corpora with event hierarchies: HiEve and the Intelligence Community corpus. No prior system has been evaluated on both corpora. Our model outperforms previous systems on both corpora, achieving 0.74 BLANC F1 on the Intelligence Community corpus and 0.70 F1 on the HiEve corpus, respectively a 15 and 5 percentage point improvement over previous models. 
    more » « less
  6. Animacy is a necessary property for a referent to be an agent, and thus animacy detection is useful for a variety of natural language processing tasks, including word sense disambiguation, co-reference resolution, semantic role labeling, and others. Prior work treated animacy as a word-level property, and has developed statistical classifiers to classify words as either animate or inanimate. We discuss why this approach to the problem is ill-posed, and present a new approach based on classifying the animacy of co-reference chains. We show that simple voting approaches to inferring the animacy of a chain from its constituent words perform relatively poorly, and then present a hybrid system merging supervised machine learning (ML) and a small number of hand-built rules to compute the animacy of referring expressions and co-reference chains. This method achieves state of the art performance. The supervised ML component leverages features such as word embeddings over referring expressions, parts of speech, and grammatical and semantic roles. The rules take into consideration parts of speech and the hypernymy structure encoded in WordNet. The system achieves an F1 of 0.88 for classifying the animacy of referring expressions, which is comparable to state of the art results for classifying the animacy of words, and achieves an F1 of 0.75 for classifying the animacy of coreference chains themselves. We release our training and test dataset, which includes 142 texts (all narratives) comprising 156,154 words, 34,698 referring expressions, and 10,941 co-reference chains. We test the method on a subset of the OntoNotes dataset, showing using manual sampling that animacy classification is 90% +/- 2% accurate for coreference chains, and 92% +/- 1% for referring expressions. The data also contains 46 folktales, which present an interesting challenge because they often involve characters who are members of traditionally inanimate classes (e.g., stoves that walk, trees that talk). We show that our system is able to detect the animacy of these unusual referents with an F1 of 0.95. 
    more » « less